Tether’s open-source TurboQuant release compresses the memory AI needs during long sessions, letting laptops, phones, edge devices, and decentralized networks handle larger documents, longer conversations, codebases, and personal AI assistants without sending everything to the cloud
1 June 2026 – Tether’s AI Research Group today announced the production release of its open source implementation of TurboQuant, the Google Research memory compression algorithm that drew comparisons to “Pied Piper” from Silicon Valley for its ability to dramatically reduce the memory large AI models need to run. With TurboQuant, Google made a breakthrough in research. Tether is bringing it to life in production with its open-source local/edge AI engine QVAC Fabric, started as a llama.cpp, now Fabric incorporates several breakthroughs that push the boundaries of local on-device intelligence.
The release turns TurboQuant from a paper into open source software that developers can use, test, and adapt across laptops, consumer GPUs, mobile chips, edge devices, and decentralized inference networks. It includes a full quantization pipeline, adapters for common inference frameworks, developer documentation, and workload-tuned profiles designed for real deployment outside hyperscale data centers. The change matters because memory is one of the biggest reasons useful AI tasks still get pushed to the cloud.
When someone uses an AI assistant, the model not only needs memory to load but it also needs working memory to remember the conversation, document, codebase, or instructions it has already seen. That working memory is called the KV cache, and it grows as the session gets longer. A short prompt may be easy to handle. A full contract, financial filing, research report, book, code repository, or several hours of conversation can push memory requirements beyond what most laptops, phones, and consumer GPUs can support.
At roughly 262,000 tokens, the scale of several hours of conversation or a few hundred pages of text, the KV cache for a 4B model can use about 8 GB of memory on its own. Four sessions at that size can push the cache alone to around 32 GB before accounting for the memory needed to load the model itself. That is why many AI experiences still rely on remote data centers, even when users would prefer to keep their work local.
TurboQuant changes that equation by compressing the KV cache up to 5x while maintaining output quality close to an uncompressed model. In practical terms, this means local AI can handle longer conversations, larger files, more context, and heavier workloads on the hardware people already own.
For users, this can mean asking an AI assistant on a laptop to read and analyze a hundred-page legal document without uploading the full file to a cloud provider. It can mean a student using an on-device tutor that retains an entire study session rather than losing context after a few messages. It can mean a developer running a local coding assistant that understands more of a codebase at once. It can mean a journalist, doctor, researcher, or small business owner using AI on sensitive files while keeping more of that work on the device.
For developers and startups, it means larger AI products can be built without assuming access to expensive GPU clusters. Instead of designing around short context windows, strict memory limits, or cloud-only deployment, teams can use TurboQuant to support longer sessions, larger workloads, and more flexible deployment across consumer hardware, edge devices, and peer-to-peer networks.
“Google’s research showed that AI memory could be compressed far more efficiently than most people assumed. Our work brings that breakthrough into production software that developers, startups, and users can actually build with,” said Paolo Ardoino, CEO of Tether. “If long context AI only works inside the largest data centers, then AI will be shaped by whoever owns the most hardware. TurboQuant changes what local AI can do by making memory less of a wall.”
“People should be able to ask an AI assistant to read a long document, remember a project, help with code, or work through private information without every task being forced through a remote data center,” he added. “This is what bringing TurboQuant to production makes possible. It gives local AI more memory, more context, and more room to become useful in everyday life.”
Tether’s implementation is designed for environments where production AI often runs into limits: constrained device memory, mixed hardware, long sessions, latency pressure, and deployment outside centralized cloud infrastructure. Rather than requiring teams to rebuild the research themselves, the open-source release provides the AI developer community with a shared foundation for testing, improving, and adapting TurboQuant across different systems.
TurboQuant will be included in QVAC SDK 0.12.0, making it available directly through Fabric, one of the core building blocks in that stack. QVAC SDK is the recommended integration path for developers building within Tether’s AI ecosystem. At the same time, the SDK brings together the full set of QVAC tools, libraries, and runtime components needed to build local AI applications across devices and environments.
The release also advances Tether’s broader AI strategy. The company is building toward AI that can operate closer to users, across personal devices, local networks, and decentralized infrastructure, rather than relying solely on centralized APIs and hyperscale data centers. Large compute will remain important, but Tether believes the next phase of AI will also be defined by software efficiency, portability, and the ability to run capable models where people actually use them.